Informedia @ TRECVID 2011

نویسندگان

  • Lei Bao
  • Zhen-Zhong Lan
  • Arnold Overwijk
  • Qin Jin
  • Shoou-I Yu
  • Brian Langner
  • Michael Garbus
  • Susanne Burger
  • Florian Metze
  • Alexander Hauptmann
  • Zhen-zhong Lan
چکیده

The Informedia group participated in three tasks this year, including: Multimedia Event Detection (MED), Semantic Indexing (SIN) and Surveillance Event Detection. Generally, all of these tasks consist of three main steps: extracting feature, training detector and fusing. In the feature extraction part, we extracted a lot of low-level features, high-level features and text features. Especially, we used the Spatial-Pyramid Matching technique to represent the low-level visual local features, such as SIFT and MoSIFT, which describe the location information of feature points. In the detector training part, besides the traditional SVM, we proposed a Sequential Boosting SVM classifier to deal with the large-scale unbalance classification problem. In the fusion part, to take the advantages from different features, we tried three different fusion methods: early fusion, late fusion and double fusion. Double fusion is a combination of early fusion and late fusion. The experimental results demonstrated that double fusion is consistently better, or at least comparable than early fusion and late fusion. 1 Multimedia Event Detection (MED) 1.1 Feature Extraction In order to encompass all aspects of a video, we extracted a wide variety of visual and audio features as shown in figure 1. Table 1: Features used for the MED task. Visual Features Audio Features Low-level Features • SIFT [19] • Color SIFT [19] • Transformed Color Histogram [19] • Motion SIFT [3] • STIP [9] Mel-Frequency Cepstral Coefficients High-level Features • PittPatt Face Detection [12] • Semantic Indexing Concepts [15] Acoustic Scene Analysis Text Features Optical Character Recognition Automatic Speech Recognition 1.1.1 SIFT, Color SIFT (CSIFT), Transformed Color Histogram (TCH) These three features describe the gradient and color information of a static image. We used the Harris-Laplace detector for corner detection. For more details, please see [19]. Instead of extracting features from all frames for all videos, we first run shot-break detection and only extract features from the keyframe of a corresponding shot. The shot-break detection algorithm detects large color histogram differences between adjacent frames and a shot-boundary is detected when the histogram difference is larger than a threshold. For the 16507 training videos, we extracted 572,881 keyframes. For the 32061 testing videos, we extracted 1,035,412 keyframes. Once we have the keyframes, we extract the three features by the executable provided by [19]. Given the raw feature files, a 4096 word codebook is acquired using the K-Means clustering algorithm. According to the codebook and given a region in an image, we can create a 4096 dimensional vector representing that region. Using the Spatial-Pyramid Matching [10] technique, we extract 8 regions from an keyframe image and calculate a bag-of-words vector for each region. At the end, we get a 8× 4096 = 32768 dimensional bag-of-words vector. The 8 regions are calculated as follows. • The whole image as one region. • Split the image into 4 quadrants and each quadrant is a region. • Split the image horizontally into 3 equally sized rectangles and each rectangle is a region. Since we only have feature vectors describing a keyframe, and a video is described by many keyframes, we compute a vector representing a whole video by averaging over the feature vectors from each keyframe. The features are then provided to a classifier for classification. 1.1.2 Motion SIFT (MoSIFT) Motion SIFT [3] is a motion-based feature that combines information from SIFT and optical flow. The algorithm first extract SIFT points, and for each SIFT point, it checks whether there is a large enough optical flow near the point. If the optical flow value is larger than a threshold, a 256 dimensional feature is computed for that point. The first 128 dimensions of the feature vector is the SIFT descriptor, and the latter 128 dimensions describes the optical flow near the point. We extracted Motion SIFT by calculating the optical flow between neighboring frames, but due to speed issues, we only extract Motion SIFT for the every third frame. Once we have the raw features, a 4096 dimensional codebook is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.3 Space-Time Interest Points (STIP) Space-Time Interest Points are computed using code from [9]. Given the raw features, a 4096 dimensional code is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.4 Semantic Indexing (SIN) We predicted the 346 semantic concepts from Semantic Indexing 11 onto the MED keyframes. For details on how we created the models for the 346 concepts, please refer to section 2. Once we have the prediction scores of each concept on each keyframe, we compute a 346 dimensional feature that represents a video. The value of each dimension is the mean value of the concept prediction scores on all keyframes in a given video. We tried out different kinds of score merging techniques, including mean and max, and mean had the best performance. These features are then provided to a classifier for classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Examining User Interactions with Video Retrieval Systems

The Informedia group at Carnegie Mellon University has since 1994 been developing and evaluating surrogates, summary interfaces, and visualizations for accessing digital video collections containing thousands of documents, millions of shots, and terabytes of data. This paper reports on TRECVID 2005 and 2006 interactive search tasks conducted with the Informedia system by users having no knowled...

متن کامل

Informedia@TRECVID 2011: Surveillance Event Detection

This paper presents a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2011 campaign. We investigate a generic statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, which is named as MoSIFT and generated by pair-wise vide...

متن کامل

Mining Novice User Activity with TRECVID Interactive Retrieval Tasks

This paper investigates the applicability of Informedia shot-based interface features for video retrieval in the hands of novice users, noted in past work as being too reliant on text search. The Informedia interface was redesigned to better promote the availability of additional video access mechanisms, and tested with TRECVID 2005 interactive search tasks. A transaction log analysis from 24 n...

متن کامل

Lessons for the Future from a Decade of Informedia Video Analysis Research

The overarching goal of the Informedia Digital Video Library project has been to achieve machine understanding of video media, including all aspects of search, retrieval, visualization and summarization in both contemporaneous and archival content collections. The base technology developed by the Informedia project combines speech, image and natural language understanding to automatically trans...

متن کامل

Summarizing BBC Rushes the Informedia Way

For the first time in 2007, TRECVID considered structured evaluation of automated video summarization, utilizing BBC rushes video. This paper discusses in detail our approaches for producing the submitted summaries to TRECVID, including the two baseline methods. The cluster method performed well in terms of coverage, and adequately in terms of user satisfaction, but did take longer to review. W...

متن کامل

Informedia @ TRECVID2008: Exploring New Frontiers

The Informedia team participated in the tasks of Rushes summarization, high-level feature extraction and event detection in surveillance video. For the rushes summarization, our basic idea was to use subsampled video at the appropriate rate, showing almost the whole video faster, and then modify the result to remove garbage frames. Sinply subsampling the frames proved to be the best method for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015